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(57) Abstract: Distortion artifacts preceding a signal tran- 
sient in an audio signal stream processed by a transform- 
based low-bit-rate audio coding system employing coding 
blocks are reduced by detecting a transient in the audio sig- 
naJ stream and shiftjng the temporal relationship of the tran- 
sient with respect lo the coding blocks such that the lime 
duration of the distortion artifacts is reduced. The audio 
data is time scaled in such a way that the transients are tem- 
porally repositioned prior to quantization in a transform- 
based low-bit-rate audio encoder so as to reduce the amount 
of pre-noise in the decoded audio signal. Altematively, or 
in addition, in a transform -based low-bit-rate audio coding 
system, a transient in the audio signal stream is detected 
and a portion of the distortion artifacts are time compressed 
such that the time duration of the distortion artifacts is re- 
duced. 



SOOCIO: <WO_02093560A 1 _l _> 



wo 02/093560 Al 



Eurasian patent (AM, AZ, BY, KG, KZ, MD, RU. TJ, TM). — with amended claims 
European patent (AT/BE, CH, CY, DE, DK, ES, FR, 

GB, GR, IE, IT. LU. MC, NL, PT, SE, TR). OAPI patent For two-letter codes and other abbreviations, refer to the "Cuid- 

(BF, BJ. CF, CG. CI. CM. GA. GN. GQ. GW, ML. MR. ^^^^ y^^^^^ and Abbreviations" appearing at the begin- 

NE, SN. TD. TG). of each regular issue of the PCT Gazette. 

Published: 

— with international search report 



IStXX:iD: <WO_02093560A1„L> 



wo 02/093560 



PCT/US02/12957 



- I - 

DESCRIPTION 

Improving Transient Pevfonnance of Low Bit Rate Audio Coding 
Systems by Reducing Pre-Noise 

5 

TECHNICAL FIELD 

Tlie invention relates generally to high-quality, low bit rate digital transfonn 
10 encoding and decoding of infonnation representing audio signals such as music or 
voice signals. More particulaily, the invention relates to the reduction of distortion 
artifacts preceding a signal transient ("pre-noise") in an audio signal stream produced 

s 

by such an encoding and decoding system. 

15 BACKGROUND ART 

Time Scaling 

Time scaling refers to altering the time evolution or duration of an audio signal 
wliile not altering its specti'al content (perceived timbre) or perceived pitch (where 
pitch is a characteristic associated with periodic audio signals). Pitch scaling refers 

20 to modiiyuig the specti al content or perceived pitch of an audio signal while not 
affecting its time evolution or duration. Time scaling and pitch scaling aie dual 
methods of one another. For example, a digitized audio signal's pitch may be 
increased 5% without affecting its time duration by time scaling it by 5% (i.e., 
increasing die time duration of the signal) and then reading out the samples at a 5% 

25 liigher sample rate (e.g., by resampling), thereby maintaining its original tune 

duration. The resulting signal has the same time duration as the original signal but 
with modified pitch or spectral chaiacteristics. Resampling is not an essential step of 
time scalmg or pitch scaling unless it is desired to maintain a constant output 
sampling rate or to maintain the input and output sampling rates the same. 

30 In aspects of the present invention, time scaling processing of audio sti eams is 

employed. However, as mentioned above, time scaling may also be perfonned using 
pitch-scaling teclmiques, as they ai'e duals of one another. Thus, while the tenn 
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"time scaling" is used herein, techniques that employ pitch scaling to achieve time 
scaling may also be employed. 

Low Bit Rate Audio Coding 
Tliere is considerable interest among those in the field of signal processing to 

5 minimize tlie amount of infomiation required to represent a signal without 

perceptible loss in signal quality. By reducing infonnation requirements, signals 
impose lower infonnation capacity requirements upon communication channels and 
storage media. With respect to digital coding techniques, minimal infoiinational 
requii ements are synonymous widi minimal binary bit requirements. 

10 Some prior ait techniques for coding audio signals intended for human hearing 

attempt to reduce infonnation requii ements without producing any audible 
degradation by exploiting psychoacoustic effects. The hmnan ear displays 
frequency-analysis propeities resembling those of highly asymmetiical tuned filters 
having vaiiable center frequencies. Tlie ability of the human ear to detect distinct 

15 tones generally increases as the difference in frequency between the tones increases; 
however, tlie ear 's resolving ability remains substantially constant for frequency 
differences less than the bandwidth of the above mentioned filters. Thus, the 
fi equency-resolving ability of the human ear varies according to the bandwidth of 
tiiese filters thi oughout the audio spectiiim. The efifective bandwidth of such an 

20 auditoiy filter is refened to as a critical band. A dominant signal within a critical 
band is more likely to mask die audibility of other signals anywhere within tliat 
critical band than other signals at frequencies outside that critical band. A dominant 
signal may mask other signals occuiiing not only at the same time as the masking 
signal, but also occuning before and after die masking signal. The duration of pre- 

25 and post-masking effects within a critical band depend upon the magnitude of the 
masking signal, but pre-masking effects ai e usually of much shorter duration than 
post-masking effects. See generally, tho Audio Engineering Handbook, K. Blair 
Benson ed., McGraw-Hill, San Francisco, 1988, pages 1.40-1.42 and 4.8-4.10. 



wo 02/093560 PCT/US02/12957 



-3- 

Signal recording and tiaiismitting techniques. that divide the useful signal 
bandwidth into fiequency bands with bandwidths approximating the eai 's critical 
bands can better exploit psychoacoustic effects than wider band techniques. 
Techniques that exploit psychoacoustic masking effects can encode and reproduce a 
5 signal that is indistinguishable fi om tlae original input signal using a bit rate below 
thatrequii ed by PCM coding. , • . , . ^ t 

Critical < band teclmiques comprise dividing the: signal ;band width into ' 
frequency bands,,processing.the signal in each frequency band, and reconstiaicting a 
replica of the origmal signal fr om the processed signal in each frequency band. Two 
10 such techniques are sub-band coding and transfoim codirig. , Sub-band anditransform 
coders can reduce transmitted infonnational requirements in particular fr equency 
bands where the resulting coding inaccuracy (noise) is psychoacoustically msisked by 
neigliboiing spectial components without degrading the subjective quality of the 
encoded signal. ..; : 

15 A bank of digital bandpass filters may implement sub-band coding. Transform 

coding may be implemented by any of several time-domain to frequency-domain 
discrete transfonns that implement a bank of digital bandpass filters. The remaining 
discussion relates, more paiticulaily to transfoim coders, therefore the term "sub- 
band" is used here to refer to selected portions of the total signal bandwidth, whether 
20 implemented by. a sub-band coder or a transfonn coder. ,;A sub-band as implemented 
by a t ansform coder is defined by a set of one or more adjacent transfonn 
coefficients; hence, the sub-band bandwidth is a multiple of tlie transfoim coefficient 
bandwidth. Tlie bandwidtli of a ti ansfonn coefficient is proportional to tlie input 
signal sampling rate and inversely proportional to the number of coeflELcients 
25 generated by the transfonn to represent the input signal. 

Psychoacoustic masking may be more easily accomplished by. transfonn 
coders if the sub-band bandwidtli tiu ougliout the audible spectnim is about half the 
critical bandwidth of the human ear in the saine portions of the spectnun. This is 
because the critical bands of the human ear have yaiiable center frequencies that 
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adapt to auditoiy stimuli, whereas sub-baiid and tiansfonn coders typically have 
fixed sub-band center frequencies. To optimize the utilization of psychoacoustic- 
masking effects, any distortion aitifacts resulting from tlie presence of a dominant 
signal should be liiiiited to the sub-band containing the dominant signal. If the sub- 

5 band baiidwidth'is about half or less »than half of the critical^band and if filter • 

selectivity is sufficiently high, effective masking of tlie undesired distortion products 
is likely to occur even for' signals whose 'fi'equency is near the edge of the sub-band 
passband bandwidth. If the sub-band bandwidth is more than half a critical band, 
tliere is a pbssibility-that the dolninaht signal may cause the 'eiar's critical band to be 

1 0 offset from die coder's sub-rband such that some of the undesired distortion products 
outside tlie ear's critical bandwidth* are not masked. This effect is most objectionable 
at low frequencies where the ear's critical band is nan-ower. 

Tlie probability that a dominant signal may cause the ear's critical band to 
offset from a coder sub-band and thereby "uncover" other signals in the same coder 

15 sub-band is generally greater at low frequencies where the ear's critical band is 
narrower. In transfonn coders; the narrowest possible sub-band is one transform 
coefficient, therefore •plsych6a:cousti'c masking may be more easily accomplished if 
the transfonn cbefecieht bahdwiidth does not exceed one half die bandwidth of the 
ear' s tiaiTowesf critical ban^; Increasing the length of the trarisfoim may decrease 

20 the ti ansfdrm coefficient bandwidth. One disadvantage of increasing tlie length of 
die transform is an increase' in the processing complexity to compute the transform 
and to encode larger nmnbers of nanower sub^bands. Other disadvantages are 
discussed below:'-' I * ' 

Of cbiirse,-psychoaLCoiistic masking may be achieved using wider sub-bands if 

25 the center frequency of these sub-bands can be sliifted' to' follow dominant signal 
components in much the same way the ear's critical band center frequency shifts. 

Tlie ability 'of a: ti ansforai coder to' exploit psychbacoustic masking effects also 
depends^upon die selectivity of the filter bank implemented by the transforai. Filter 
"selectivity," as that tenn is usied here, refers to two characteristics of sub-band 
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bandpass filters. The fij-st is tlie bandwidth of the regions between the filter pass- 
band and stopbands (the width of the ti ansition bands). Tlie second is the attenuation 
level in the stopbands. Thus, filter selectivity refers to the steepness of the filter 
response curve witliin the transition bands (steepness of transition band roUoff), and 
5 the level of attenuation in the stopbands (depth of stopband rejection). 

Filter selectivity is directly affected by numerous factors including tlie three 
factors discussed below: block length, window weighting functions, and ti'ansfoiTns. 
In a veiy general sense, block length affects coder temporal and fiequency resolution, 
and windows and tiansfonns affect coding gain. 

10 Low Bit Rate Audio Coding / Block Length 

The input signal to be encoded is sampled and segmented into "signal sample 
blocks" prior to sub-band filtering. Tlie number of samples in the signal sample 
block is the signal sample block length. 

It is common for the number of coefficients generated by a ti ansfonn filter 

15 bank (tlie transfonn length) to be equal to the signal sample block length, but this is 
not necessaiy. An overlapping-block tiansfonn may be used and is sometimes 
described in the ait as a transfonn of length N that ti ansfonns signal sample blocks 
with 2N samples. This transfonn can also be described as a ti ansfonn of length 2N 
that generates only N unique coefficients. Because all the tiansfonns discussed here 

20 can be thought to have lengths equal to the signal sample block length, the two 
lengdis aie generally used here as synonyms for one another. 

The signal sample block length affects the temporal and fiequency resolution 
of a ti ansfonn coder. Transfonn coders using shorter block lengths have poorer 
fi-equency resolution because the discrete tiansfonn coefficient bandwidth is wider 

25 and filter selectivity is lower (decreased rate of ti ansition band roUoff and a reduced 
level of stopband rejection). Tliis degradation in filter peifonnance causes the energy 
of a single spectral component to spread into neighboiing ti ansfonn coefficients. 
This undesii able spreading of specti al energy is the result of degraded filter 
peifonnance called "sidelobe leakage." 
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Transfonn coders using longer block lengtlis have poorer temporal resolution 
because quantization errors cause a tiansfonn encoder/decoder system to "smear" the 
fi equency components of a sampled signal across tiie full length of the signal sample 
block. Distortion aitifacts in the signal recovered from the inverse tiansforai ai e 
5 most audible as a result of laige changes in signal amplitude that occur during a time 
interval much shorter than the signal sample block length. Such amplitude changes 
are referred to here as "transients." Such distortion manifests itself as noise in the 
forai of an echo or ringing just before (pre-transient noise or "pre-noise") and just 
after (post-tiansient noise) the tiansient. Pre-noise is of paiticular concern because it 
10 is highly audible and, unlike post-h ansient noise, only minimally masked (a tiansient 
provides only minimal temporal pre-masking). Pre-noise is produced when the high 
fiequency components of tiansient audio material eac temporally smeared thiough 
tiie length of the audio coder block in which it occurs. The present invention is 
concerned with minimizing pre-noise. Post-transient noise typically is substantially 
15 masked and is not the subject of the present invention. 

Fixed block length tiansfonn coders use a compromise block length that trades 
off temporal resolution against frequency resolution. A short block length degrades 
sub-band filter selectivity, which may result in a nominal passband filter bandwidth 
that exceeds the eai's critical bandwidth at lower or at all, frequencies. Even if the 
20 nominal sub-band bandwidth is nanower than the ear's critical bandwidth, degraded 
filter chai acteristics manifested as a broad tiansition band and/or poor stopband 
rejection may result in significant signal aitifacts outside the ear's critical bandwidth. 
On the other hand, a long block length may improve filter selectivity but reduces 
temporal resolution, which may result in audible signal distortion occuiiing outside 
25 the ear's temporal psychoacoustic masking intewal. 

Window Weighting Function 
Discrete ti ansfomis do not produce a perfectly accurate set of fiequency 
coefficients because they work with only a finite-length segment of the signal, the 
signal sample block. Sd ictly speaking, discrete ti ansfonns produce a time-fiequency 
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representation of the input time-domain signal rather than a tine fi equency-domain 
representation which would require infinite signal sample block lengths. For 
convenience of discussion here, however, the output of discrete ti ansfonns is refen ed 
to as a frequency-domain representation. In effect, tiie discrete transform assumes 

5 tliat die sampled signal only has frequency components whose periods aie a 

submultiple of the signal sample block length. This is equivalent to an assumption 
that the finite-length signal is periodic. The assumption in general, of course, is not 
tine. Tlie assumed periodicity creates discontinuities at the edges of the signal 
sample block that cause die transfonn to create phantom spectral components. 

10 One technique that minimizes this effect is to reduce the discontinuity prior to 

tlie transfonnation by weighting the signal samples such that samples near the edges 
of the signal sample block ai e zero or close to zero. Samples at the center of the 
signal seuiiple block ai*e generally passed unchanged, /.e., weighted by a factor of 
one. Tliis weighting function is called an "analysis window." The shape of the 

15 window directly affects filter selectivity. 

As used here, the tenii "analysis window" refers only to the windowing 
function performed prior to application of the foi"ward transforai. The analysis 
window is a time-domain function. If no compensation for the window's effects is 
provided, the recovered or "synthesized" signal is distorted according to the shape of 

20 the analysis window. One compensation method known as overlap-add is well 

known in the ait. This method requires the coder to transform overlapped blocks of 
input signal samples. By caiefully designing the analysis window such that two 
adjacent windows add to unity across the overlap, the effects of the window ai e 
exactly compensated. 

25 Window shape affects filter selectivity significantly. See generally, Haiiis, 

"On tlie Use of Windows for Hamionic Analysis with the Discrete Fourier 
Transfoim," Proc IEEE, vol. 66, January, 1978, pp. 51 — 83. As a general rule, 
"smoother" shaped windows and laiger overlap inteivals provide better selectivity. 
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For example, a Kaiser-Bessel window generally provides for greater filter selectivity 
than a sine-tapered rectangular window. 

When used with certain types of bansfonns such as the Discrete Fourier 
Transfonn (DFT), overlap-add increases the number of bits requii ed to represent the 

5 signal because tlie portion of the signal in the overlap inteival must be tiansfonned 
and transmitted twice, once for each of the two overlapped signal sample blocks. 
Signal analysis/syndiesis for systems using such a transform with overlap-add is not 
critically sampled. Tlie teim "critically sampled" refers to a signal analysis/synthesis 
which over a period of time generates the same number of frequency coefficients as 

10 the number of input signal samples it receives. Hence, for noncritically sampled 
systems, it is desirable to^design tlie wmdow with an overlap intei-val as small as 
possible in order to miniinize die coded signal infoimation requirements. 

Some transfoiins also require that the synthesized output from the inverse 
tiansfoixn be windowed. The synthesis window is used to shape each synthesized 

15 signal block. Tlierefore, the syntliesized signal is weighted by both an analysis and a 
syndiesis window. This two-step weighting is mathematically similai' to weighting 
the original signal once by a window whose shape is equal to a sample-by-sample 
product of the analysis and synthesis windows. Therefore, in order to utilize overlap- 
add to compensate for windowing distortion, both windows must be designed such 

20 that the product of die two sums to unity across the overlap-add intei-val. 

While there is no single criterion that may be used to assess a window's 
optimality, a window is generally considered "good" if the selectivity of the filter 
used widi die window is considered "good." Therefore, a well designed analysis 
window (for tiansfonns that use only an analysis window) or analysis/synthesis 

25 window pair (for tiansfonns tiiat use both an analysis and a synthesis window) can 
reduce sidelobe leakage. 

Block Switching 

A common solution that addresses the compromise between temporal and 
frequency resolution in fixed block length transfonn coders is the use of transient 
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detection aiid block length switching. In this solution the presence and location of 
audio signal tiansients ai e detected using vaiious ti ansient detection methods. When 
tiansient audio signals are detected that are likely to introduce pre-noise when coded 
using a long audio coder block length, the low bit rate coder switches from the more 

5 efficient long block length to a less efficient shorter block length. While this reduces 
me u equeiioy resuiiiaori and cbdiiig emcichcy of me ericoded audio sigriai it aiso 
reduces the length of ti ansient pre-noise introduced by the coding process, improving 
tlie perceived quality of the audio upon low bit rate decoding. Techniques for block 
length switching aie disclosed in U.S. Patents 5,394,473; 5,848,391; and 6,226,608 

10 Bl, each of which is hereby incoiporated by reference in its entii ety. Although the 
present invention reduces pre-noise without the complexity and disadvantages of 
block switching, it may be employed along with and in addition to block switching. 

DISCLOSURE OF THE INVENTION 
In accordance with a first aspect of the present invention, a method for 
reducing distortion aitifacts preceding a signal transient in an audio signal stream 
processed by a f ansfonn-based low-bit-rate audio coding system employing coding 
blocks comprises detecting a ti-ansient in the audio signal sti eam, smd shifting the 
temporal relationship of the ti ansient with respect to the coding blocks such that the 
tune dm ation of the distortion aitifacts is reduced. 

An audio signal is analyzed and the locations of ti ansient signals ai e identified. 
The audio data is then time scaled in such a way that the ti ansients are temporally 
repositioned prior to quantization in a tiansfonn-based low-bit-rate audio encoder so 
as to reduce the amount of pre-noise in the decoded audio signal. Such processing 
prior to encoding and decoding is refeired to herein as "pre-processing." 

Thus, before quantization in the encoder, because die quantization process 
smeais die tiansient thioughout the encoding block creatiag the undesiied pre-noise 
aitifacts, the transient is shifted to a better position vis-a-vis block ends using time 
scaling (time compression or time expansion). Such pre-processing may also be 
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refened to as "bansient time shifting". Transient time shifting requiies tlie 
identification of transients and also requires infonnation as to tlieir temporal location 
relative to block ends. In piinciple, tiansient time shifting may be accomplished in 
the time domain prior to application of the forwai d ti ansfonxi or in tlie frequency 

5 domain following application of the forward transfonn but prior to quantization. In 
practice, ti ansient tune shifting may be more easily accomplished in the time domain 
prior to application of the fomard ti ansfonn, particulaily when a compensating time 
scaling is perfonned as described below. 

The results of transient time shifting may be audible because both the tiansient 

1 0 and the audio stieam aie no longer in their original relative temporal positions — the 
time evolution of the audio sti eam is altered as a result of time compression or time 
expansion of the audio sti eam before the transient. A listener may perceive this as an 
alteration in the rhythm witliin a musical piece, for example. 

There are several compensation techniques for reducing such an alteration in 

15 the audio sti eam' s time evolution that fonn aspects of the present invention. These 
compensation teclmiques are optional because slight valuations in the temporal 
evolution of an audio signal ai e not discemable to most listeners. Compensation 
techniques are discussed after the following discussion of a second aspect of the 
present invention. 

20 In accordance with a second aspect of tlie present invention, in an encoder of a 

tiansfoim-based low-bit-rate audio coding system employing coding blocks, a 
method for reducing distortion aitifacts preceding a signal tiansient in an audio signal 
stream subsequent to inverse transfonnation, comprises detecting a transient in the 
audio signal stream, and time compressing at least a poition of the distortion artifacts 

25 such that the time duration of the distortion aitifacts is reduced. 

By such processing, refeired to as "post-processing" herein, audio quality 
improvements to any audio signal tiiat has undergone low bit rate audio encoding 
may be obtained whether or not pre-processing is employed and, if it is employed, 
whether or not the encoder tiansmits metadata usefiil for the post-processing. Any 
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audio signal that has undergone low bit rate audio encoding and decoding may be 
analyzed to identify the location of ti'ansient signals and to estiinate tlie duration of 
tiansient pre-noise aitifacts. Tlieu, time scale post-processing may be perfoimed on 
the audio so as to remove the tiansient signal pre-noise or reduce its duration. 
5 As mentioned above, tliere are several compensation techniques for reducing 

alterations in die audio stream's tune evolution. These time scaling compensation 
techniques also have the beneficial result of keeping the nmnber of audio samples 
constant. 

A fir st time scaling compensation technique, useful in coxmection wifli pre- 

10 processing, is applied before the fomaid transfonn. It applies a compensating time 
scaling to the audio stieam following the transient, the time scaling having a sense 
opposite to the sense of the tune scaling employed to shift the transient position and, 
preferably, having substantially the same duration as the transient-shifting time 
scaling. For convenience in discussion, this type of compensation is referred to 

15 herein as "sample number compensation" because it is capable of keeping the 

number of audio samples constant but is not capable of fiilly restoring the original 
temporal evolution of the audio signal stream (it leaves the ti ansient and portions of 
the signal stream neai- the transient out of place temporally). , Preferably, the time- 
scaling providing sample nmnber compensation closely folio vys die ti ansient such 

20 that it is temporally post-masked by tlie transient. 

Although sample nmnber compensation leaves the tiansient sliifted from its 
original temporal position, it does restore the audio stieam following the 
compensating tune scaling to its oiiginal relative temporal position. Thus, the 
likelihood of audibility of the ti ansient time shifting is reduced, although it is not 

25 eliminated, because the ti ansient is still out of its original position. Neveitheless, this 
may provide a sufficient reduction in audibility and it has the advantage that it is 
done prior to low bit-rate audio encoding, allowing the use of a standai'd, umnodified 
decoder. As explained below, a full restoration of the audio signal stieam's time 
evolution can only be accomplished by processing in the decoder or following the 
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decoder. In addition to reducing the possibility of audibility of the transient time 
shifting, time-scaling compensation before foi-vvai'd transfonnation has the advantage 
of keeping tlie nmnber of audio samples constant, which may be important for 
processing and/or for the operation of hardware implementing the processing. 
5 In order to provide optimum tune-scaling compensation before forward 

transformation, inforaiation as to the location of the transient and the temporal length 
of the tiansient time shifting should be employed by the compensation process. 

If tiansient time shifting is applied after blocking (but before applying tlie 
foi-wai d fa ansfonn), it is necessaiy to employ sample nmnber compensation within 
10 the same block in' which ti ansient time shifting is done in order to keep the block 
length the same. Consequently, it is prefeired to perfonn tiansient time shifting and 
sample munber compensation before blocking. 

Sample nmnber compensation may also be employed after the inverse 
ti aiisfonn (either in the decoder or after decoding) in connection with post- 
15 processing. In this case, infonnation useful for perforaiing compensation may be 
sent to the compensation process fiom the decoder (which infonnation may have 
originated in the encoder and/or the decoder). 

A more complete restoration of the audio signal stream's temporal evolution 
along with restoiing the original number of audio samples may be accomplished after 
20 tiie inverse ti'ansfonn (either in tlie decoder or following decoding), by apply a 
compensating tune iscaling to the audio stieam before the ti ansient in tlie sense 
opposite to the sense of the time scaling employed to shift the transient position and, 
preferably, of substantially the same duration as the transient-shifting time scaling. 
For convenience in discussion, tliis type of compensation is refen ed to herein as 
25 "time evolution compensation." This time scaling compensation has the significant 
advantage of restoiing the entire audio sti eam, including the ti ansient, to its original 
relative temporal position. Tlius, the likelihood of audibility of the time scaling 
processes is greatly reduced, although not eliminated, because the two time scaling 
processes tliemselves may cause audible artifacts. 
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In order to provide optimum time-evolution compensation, vaiious 
infonnation such as tlie location of the ti ansient, tlie location of the block ends, the 
length of the tiansient time shifting, and the length of the pre-noise is useful. The 
lengtli of the pre-noise is useftil in assuiing that the time-scaling of the time evolution 
5 compensation does not occiu* duiing the pre-noise, thus possibly expanding tlie 

temporal length of the pre-noise. The length of the transient tune sliifting is useful if 
it is desired to restore the audio sti eam to its original relative temporal position and to 
maintain the number of samples constant. Tlie location of the ti ansient is useful 
because tlie length of the pre-noise may be detennined fi om the original location of 
10 tlie transient with respect to the ends of tlie coding blocks. The length of the pre- 
noise may be estimated by measuiing a signal pai'ameter, such as high-fiequency 
content or a default value may be employed. If the compensation is perfonned in the 
decoder or after decoding, useful infonnation may be sent by the encoder as metadata 
along with tlie encoded audio. When perfonned after decoding, metadata may be 
15 sent to the compensation process fioin the decoder (which infonnation may have 
originated in the encoder and/or the decoder). 

As mentioned above, post-processing to reduce the length of the pre-noise 
aitifact may also be applied as an additional step to an audi6 coder that perfonns time 
scaling pre-processing and, optionally, provides metadata infonnation. Such post- 
20 processing would act as an additional quality improvement scheme by reducing the 
pre-noise tliatmay still remain after pre-processing. 

Pre-processing may be prefened in coder systems employing professional 
encoders in wliich cost, complexity and time-delay are relatively immaterial in 
compaiison to post-processing in connection witli a decoder, which is typically a 
25 lower complexity consmner device. 

The low bit rate audio coding system quality improvement teclinique of the 
present invention may be implemented using any suitable tune-scaling teclinique, as 
well as any that may become available in the future. One suitable technique is 
described in International Patent Application PCT/US02/04317, filed Febniaiy 12, 
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2002, entitled High Quality Time-Scaling and Pitch-Scaling of Audio Signals. Said 
application designates the United States and otlier entities. The application is hereby 
incoiporated by reference in its entii ety. As discussed above, since time scaling and 
pitch shifting aie dual metliods of one anotiier, time scaling may also be implemented 

5 using any suitable pitch scaling technique, as well as any that may become available 
in the future. Pitch scaling following by reading out the audio samples at an 
appropriate rate that is different tlian the input sample rate results in a time scaled 
version of the audio with the same spech al content or pitch of the original audio and 
is applicable to the present invention. 

10 As discussed in the low bit rate audio coding backgromd smnmary, the 

selection of block length^in an audio coding system is a trade-off between fiequency 
and temporal resolution. In general, a longer block length is prefeired as it provides 
increased efficiency of the coder (generally provides greater perceived audio quality 
with a reduced number of data bits) in compai ison to a shorter block length. 

15 However, tiansient. signals and the pre-noise signals that they generate offset the 

quality gain of longer block lengths by uiti oducing audible impainnents. It is for this 
reason that block switcliing or fixed smaller block lengths are used in practical 
applications of low bit rate audio coders. However, applying time scaling pre- 
processing in accordance with the present invention to audio data that is to imdergo 

20 low bit rate audio coding and/or has undergone post-processuig may reduce the 

duration of tiansient pre-noise. This allows longer audio coding block lengtlis to be 
used, thereby providing increased coding efficiency and improving perceived audio 
quality without adaptively switching block lengths. However, the reduction of pre- 
noise in accordance with tl\e present invention may be also employed in coding 

25 systems that employ block length switching. In such systems, some pre-noise may 
exist even for the smallest window size. Tlie laiger the window, the longer and, 
consequently, more audible the pre-noise is. Typical transients provide 
approximately 5 msec of premasking, which tianslates to 240 samples at a 48 kHz 
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sanipling rate. If a window is laiger than 256 samples, which is common in a block 
switching airangement, the invention provides some benefit. 

Audio Coding Transient Pre-Noise Artifacts 
FIGS, la-le show examples of transient pre-noise aitifacts. generated by a 

5 fixed block length audio coder system. FIG. la shows six, 50% overlapped, audio 
coding windowed blocks of fixed length 1 thiough 6. In tliis figme and all other 
figures herein, each wmdow is contiguous with an audio coding block and is refeiTed 
to as a wuidowed block," "window," or "block." In this figure and certain other 
figures herein, the windows are shown generally in the shape of a Kaiser-Bessel 

10 window. Otlier figures show windows in the shape of semi-ciides for simplicity in 
presentation. Window shape is not critical to tlie present invention. While the length 
of the windowed blocks in FIG. la and other figures is not critical to the invention, 
fixed length windowed blocks typically are in the range of 256 to 2048 samples in 
lengdi. Tlie four audio signal examples in FIGS, lb through le illustrate, 

15 respectively, the effects of temporal relationships between the audio coding 
windowed blocks and the transient pre-noise aitifacts. 

FIG. lb illustrates the relationship between the location of a transient signal in 
an input audio stieam to be coded and the borders of the 50% overlapping windowed 
blocks. While a 50% overlapping fixed block length is shown, the invention is 

20 applicable to botli fixed and vaiiable block length coding systems and to blocks 
having other than a 50% overlap, including no overlap as is discussed below in 
connection widi FIGS. 2a thiough 5b. 

FIG. 1 c shows the audio signal sti eam output of tlie audio coding system for 
the case of gm audio signal stieam input as shown in FIG. lb. As shown in FIGS, lb 

25 and Ic, the ti ansient is located between the end of windowed block 3 and the end of 
windowed block 4. FIG. Ic illusti ates the location and length of transient pre-noise 
ind oduced by tlie low bit rate audio coding process in relation to the location of the 
ti ansient and tlie end of windowed block 2. Note that tlie pre-noise is prior to the 
transient and is limited to windowed blocks 4 and 5, tlie sample blocks in which tlie 
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transient lies. Thus, tlie pre-noise extends back to the beginning of windowed block 
4. 

Siinilaiiy to FIGS, lb and Ic, FIGS. Id and le show, respectively, the 
relationsliip between an input aiudio signal stream that contains a ti-ansient located 

5 between the end of windowed block 2 and the end of windowed block 3 and the pre- 
noise introduced in the output audio signal sti*eam by the audio coding system. 
Because the pre-noise is limited to windowed blocks 3 and 4, within which the 
transient lies, the pre-noise extends back to the beginning of windowed block 3. In 
this case, the pre-noise has a longer duration because the transient is nearer the end of 

10 windowed block 3 than the'transient of FIGS, lb and Ic is to the end of windowed 
block 4. The ideal transient location is closely following the last block end so that 
the pre-noise extends back only to tlie next prior block end (about half of the block 
length in the case of tliis 50% block overlap example). 

It should be noted that the examples in FIGS, la-le do not explicitly take into 

15 accoxmt the effects of cross fading at tlie coding window boundaiies. In general, as 
the audio coding windows taper off, the pre-noise aitifacts are scaled accordingly and 
their audibility is reduced. For simplicity in presentation, scaling of the pre-noise 
aitifacts is not shown in the idealized waveforais of the figures herein. 

As suggested in FIGS, la-le and shown in more detail in FIGS. 2 A, 2B, 3 A, 

20 3B, 4 A, 4B, 5 A and 5B, an audio coder's tiansient pre-noise aitifacts may be 

minimized if the location of transient signals is judiciously positioned prior to audio 
encoding. 

Examples of repositioning the location of a transient in order to reduce pre- 
noise aie shown in FIGS. 2a, 2b, 3 a, 3b, 4a, 4b, 5a and 5b for the cases of non- 
25 overlapping blocks (FIGS. 2a and 2b), less than 50% block overlap (FIGS. 3a and 
3b), 50% block overlap (FIGS. 4a and 4b), and greater than 50% block overlap 
(FIGS. 5 a and 5b). In each case, unless the original position of the tismsient is 
equidistant between two successive block ends (in which case tliere is no preference), 
it is prefened'to shift the tiansient to a position closely following the nearest block 
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end. Wlietlier the shift is to the prior block end or to the next block end, whether or 
not the neai'est block end, the resulting pre-noise is substantially the same. However, 
by temporally shifting the transient to a location closely following the nearest block 
end, disruption to the time evolution of the audio stream is minimized, thereby 

5 minimizing the possible audibility of shifting tlie ti ansient. NeveiUieless, in some 
cases, shifting to the more distant block end may also be inaudible. Moreover, even 
if a shifting to the more distant block end is audible, time evolution compensation, as 
described below, may be employed to reduce or eliminate such audibility. 

FIGS. 2a and 2b show a series of idealized non-overlapping windowed blocks. 

10 In FIG. 2a, a transient's initial location is, as shown by the solid-lined anow, closer to 
the last window end than it is to the next window end. The pre-noise for the 
ti ansienf s initial location extends back in time to the end of the beginning of the 
window, as shown. If it is desired to minimize the degi ee of temporal shift of the 
tiansient, it should be shifted "left" (backward in tune) to a location closely following 

15 the end of the last windowed block, as shown. Although the resulting pre-noise still 
extends back to the beginning of the windowed block, this length is very short 
compared to the pre-noise resulting from the initial transient location. In this and 
other figures, the distance of the shifted transient fi om the Windowed block end is 
exaggerated for claiity in presentation. In FIG. 2b, the initial position of tlie transient 

20 is closer to the next window end than to the previous window end. Thus, if it is 

desu ed to minimize tlie degree of temporal shift of the ti ansient, it should be shifted 
"right" (later in time) to a location closely following the end of the next windowed 
block, as shown. It will be noted that the improvement in pre-noise reduction 
increases as the initial tiansient position becomes later in the windowed block. 

25 FIGS. 3 a and 3b show a series of idealized windowed blocks that overlap by 

less than 50%. In FIG. 3a, a transient's initial location is, as shown by the solid-lined 
aiTow, closer to tlie last window end than it is to the next window end. The pre-noise 
for the transient's initial location extends back in tune to the end of the begimiing of 
the window, as shown. If it is desired to minunize the degree of temporal shift of the 
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b ansient, it should be shifted "left" to a location closely ft)llowing the end of the last 
windowed block, as shown. The resulting pre-noise still extends back to the 
begiiuiing of tlie windowed block, but tliis length is short compared to the pre-noise 
resulting fxom tlie initial h ansient location. In FIG. 3b, the initial position of tlie 
5 transient is closer to the next window end than to the previous window end. Thus, if 
it is desir ed to ntiinimize the degree of temporal shift of the transient, it should be 
shifted "light" to a location closely following the end of the next windowed block, as 
shown. It will be noted that the improvement in pre-noise reduction increases as the 
initial ti ansient position is later in the intei^val between successive windowed blocks. 
10 FIGS. 4a and 4b show a series of idealized windowed blocks that overlap by 

50%. In FIG. 4a, a tiansienfs initial location is, as shown by tlie solid-lined aixow, 
closer to the last window end than it is to the next window end. The pre-noise for the 
transient's initial location extends back in time to the end of the beginning of the 
window, as shown. If it is desired to minimize the degree of temporal shift of the 
15 tiansient, it should be shifted "left" to a location closely following the end of the last 
windowed block, as shown. The resulting pre-noise still extends back to the 
beginning of the windowed block, but this length is shorter than the pre-noise 
resulting from the initial ti ansient location. In FIG. 4b, the initial position of the 
t ansient is closer to tlie next window end than to the previous window end. Thus, if 
20 it is desir ed to minimize the degree of temporal shift of the ti ansient, it should be 

shifted "right" to a location closely following the end of the next windowed block, as 
shown. It will be noted that the improvement in pre-noise reduction increases as the 
initial ti ansient position is later in the iiitei"val between successive windowed block 
ends, as in the case of less than 50% overlapped blocks. 
25 FIGS. 5a and 5b show a series of idealized windowed blocks that overlap by 

greater than 50%. In FIG. 5 a, a tiansient's initial location is, as shown by the solid- 
lined aiTOw, closer to the last window end dian it is to the next window end. The pre- 
noise for tlie tiansient's initial location extends back in time to the end of the 
beginning of the window, as shown. If it is desu ed to minunize the degree of 
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temporal shift of the ti ansient, it should be shifted "left" to a location closely 
following the end of the last windowed block, as shown. Tlie resulting pre-noise still 
extends back to the beginning of tlie windowed block, but tliis length is still 
somewhat shorter than tlie pre-noise resulting fiom the initial tiansient location. In 

5 FIG. 5b, the initial position of the ti ansient is closer to the next window end than to 
the previous window end. Thus, if it is desii ed to minimize the degree of temporal 
sliift of the ti ansient, it should be shifted "right" to a location closely following the 
end of the next windowed block, as shown. It will be noted that the improvement in 
pre-noise reduction increases as tlie initial ti ansient position is later in the interval 

10 between successive windowed block ends, as in the case of 50% overlapped blocks. 

It will be noted that the improvement in pre-noise reduction is gieatest for non- 
overlapping blocks and decreases as the degi ee of block overlap increases. 

DESCRIPTION OF THE DRA WINGS 
15 FIGS, la-le aie a series of idealized wavefoims illustrating examples of 

ti ansient pre-noise aitifacts generated by a fixed block length audio coder system for 
two cases of input signal conditions. 

FIGS. 2a and 2b show a series of idealized non-overlapping windowed blocks 
illustiating initial and shifted ti'ansient temporal locations, along with the pre-noise 
20 for such locations, for the case of an initial position closer to the last window end 
than to the next window end and for the case of an initial position closer to the next 
window end them to the previous window end, respectively. 

FIGS. 3a and 3b show a series of idealized less than 50% overlapping 
windowed blocks illustiating initial and shifted ti ansient temporal locations, along 
25 with the pre-noise for such locations, for tlie case of an initial position closer to the 
last window end tlian to the next window end and for tlie case of an initial position 
closer to the next window end than to the previous window end, respectively. 

FIGS. 4a and 4b show a series of idealized 50% overlapping windowed blocks 
illustiating initial and shifted tiansient temporal locations, along with the pre-noise 
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for such locations, for the case of an initial position closer to flie last window end 
than to tlie next window end and for tlie case of an initial position closer to the next 
window end tlian to tlie previous window end, respectively. 

FIGS, 5a and 5b show a series of idealized greater than 50% overlapping 
5 windowed blocks illusfrating initial and shifted transient temporal locations, along 
with tlie pre-noise for such locations, for the case of an initial position closer to the 
last window end than to the next window end and for the case of an initial position 
closer to the next window end than to the previous window end, respectively. 

FIG. 6 is a flow chait showing steps to reduce tiansient pre-noise aitifacts by 
10 tinae scaling prior to low bit rate encoding. 

FIG. 7 is a conceptual representation of an input data buffer used for tr ansient 
detection. 

FIGS. 8a-8e aie a series of idealized wavefonns illustiating an example of 
audio time scaling pre-processing in accordance with aspects of the present invention 
15 when a transient exists in an audio coding block and is located closer to the last 
windowed block end dian to the next windowed block end. 

FIGS. 9a-9e ai e a series of idealized waveforms illustrating an example of 
audio time scaling processing when a bansient exists in a windowed audio coding 
block and is located approximately T samples before a block end. 
20 FIGS. 10a- lOd ai e a series of idealized wavefonns illustrating time scaling for 

the case of multiple b ansients. 

FIGS. 1 la-1 If aie a series of idealized wavefonns illustrating intelligent time 
evolution compensation of time scaling using metadata conveyed in the audio stream. 

FIG. 12 is a flow chait of time scaling post-processing in conjunction with a 
25 low bit rate audio decoder. 

FIGS. 13a- 13c aie a series of idealized wavefonns illustrating an example of 
post-processing for a single transient to reduce tlie pre-noise artifacts present after 
decoding. 
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FIG. 14 is a flow chart of a post-processing process for improving tlie 
perceived quality of audio that has undergone low bit rate coding without time 
scaling pre-processing. 

FIGS. 1 5a-l 5c are a series of idealized wavefonns demonstrating the 
5 teclinique of using a default value to time-scale the audio before each transient to 
reduce pre-noise without perforaiing sample number compensation. 

FIGS. 16a- 16c are a series of idealized wavefonns demonstrating the 
technique of using a computed pre-noise duration to time-scale the audio before each 
transient to reduce pre-noise duration with sample nmnber and time evolution 
10 compensation. 

BEST MODE FOR CARRYING OUT THE INVENTION 

Time Scaling P re-Processing Overview 

15 FIG. 6 is a flow chart illustrating a metliod for time-scaling audio prior to low 

bit rate audio encoding to reduce the amount of transient pre-noise (/.e., "pre- 
processing"). This metliod processes the input audio in N sample blocks, where N 
may correspond to a number greater than or equal to the number of audio samples 
used in the audio coding block. Processing sizes with N greater than the size of the 

20 audio coding block may be desirable to provide additional audio data outside of die 
audio coding block for use in time scaling processing. This additional data may be 
used, for example, to sample number compensate for time scaling processing 
performed to improve the location of a transient. 

The first step 202 in the process of FIG. 6 checks for tlie availability of N 

25 audio data samples for time scaling processing. These audio data samples may be 
from, for example, a file on a PC-based hard disk or a data buffer in a hardware 
device. The audio data may also be provided by a low bit rate audio coding process 
tliat invokes the time scaling processor prior to audio encoding. If N audio data 
samples are available they are passed (step 204) to and then used by tlie time scaling 

30 pre-processing process in the following steps. 
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Tlie thii d step 206 in the pre-processing process is detecting tlie location of 
audio data tiansient signals tliat aie likely to inhoduce pre-noise aitifacts. Many 
different processes aie available to perforai this function and the specific 
implementation is not critical as long as it provides accurate detection of ti^ansient 
5 signals that aie likely to intioduce pre-noise aitifacts. Many audio coding processes 
perforai audio signal ti ansient detection and this step may be skipped if the audio 
coding process provides the transient infonnation to the subsequent time scaling 
processing block 210 along with the input audio data. 

Transient Detection 

1 0 One suitable metliod for perfoiming audio signal transient detection is as 

follows. The first step in the transient detection analysis is to filter the input data 
(ti eating the data samples as a time function). The input data may, for example, be 
filtered with a 2"^ order IIR high-pass filter with a 3 dB cutoff fiequency of 
approximately 8 kHz. The filter chaiacteristics ai e not critical. This filtered data is 

15 tlien used in the tiansient analysis. Filteiing the input data isolates the high 

fiequency tiansients and makes them easier to identify. Next, the filtered input data 
ai e processed in sixty-four sub-blocks (in the case of a 4096 sample signal sample 
block) of approximately 1.5 msec (or 64 samples at 44.1 kHz) as shown in FIG. 7. 
Wliile tlie actual size of the processing sub-block is not constiained to 1.5 msec and 

20 may vaiy, this size provides a good h ade-off between real-time processing 

requLi ements (as laiger block sizes require less processing overhead) and resolution 
of ti ansient location (smaller blocks provide more detailed infonriation on the 
location of ti ansients). The use of 4096 sample signal sample blocks and the use of 
64 sainple sub-blocks is merely an example and is not critical to the invention. 

25 The next step of ti ansient detection processing is to perforai a low-pass 

filteiing of the maximmn absolute data values contained in each 64-sample sub- 
block. Tliis processing is perfonned to smooth the maximum absolute data and 
provide a general indication of the average peak values in tlie input buffer to which 
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the actual sub-buffer peak value can be compared. Tlie method described below is 
one method of doing the smoothing. 

To smoodi the data, each 64-sample sub-block is scamied for the maxiinmn 
absolute data signal value. The maximum absolute data signal value is then used to 
5 compute a smoothed, moving average peak value. The filtered, high jfrequency 
moving averages for each k^*' sub-buffer, hi_mavg(k) respectively, are computed 
using Equations 1 and 2. 

for buffer k= 1:1:64 

hijmavg(k) =^ hi_mavg(k-l) + ((hi JBeq peak val in buffer k) - liijtiiavg(k- 
10 1)) * AVG^WHT) (1) 

end 

where hi_mavg(0) is set equal to hi_mavg(64) fiom the previous input buffer for 
continuous processmg. In the cmient implementation the paiameter AVG_WHT is 
set equal to 0.25. This value was decided upon following expeiimental analysis 
1 5 using a wide range of common audio material. 

Next, the ti ansient detection processing compaies tlie peak in each sub-block 
to the airay of smoodied, moving average peak values to detenxiine whether a 
transient exists. While a number of mediods exist to compare these two measures the 
approach outlined below was taken because it allows tuning of the corapaiispn by use 
20 of a scaling factor that has been set to perforai optimally as detenxiined by; analyzing 
a wide range of audio signals. 

Tlie peak value m tiie k^^ sub-block, for the filtered data, is multiplied by the 
high frequency scaling value HI_FREQ_SCALE, and compared to the computed 
smoothed, moving average peak value of each k. If a sub-block's scaled peak value 
25 is greater than the movmg average value a tiansient is flagged as being present. 
These compaiisons aie outlined below in Equations 3 and 4. 
for buffer k= 1:1:64 
if (((lii fieq peak value in buffer k) * HI_FREQ_SCALE) > hi_,inavg(k)) 
(2) 
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flag Iiigh fiequency fransient in sub-block k = TRUE 
end 
end 

Following tiansient detection, several corrective checks aie made to detenxune 
5 whether the transient flag for a 64-sample sub-block should be cancelled (reset from 
TRUE to FALSE). These checks ai'e perfomied to reduce false transient detections. 
First, if tlie high fi equency peak values fall below a minimum peak value then the 
tiansient is cancelled (to address low level transients) . Second, if the peak in a sub- 
block tiggers a tiansient but is not significantly laiger than the previous sub-block, 
10 which also would have tiiggered a tiansient flag, then the transient in the cmrent sub- 
block is cancelled. This reduces a smeai ing of the infonnation on the location of a 
tiansient. 

Referi ing again to FIG. 6, the next step 208 in processing is to 
detennine wlietlier transients exist in the cuirent N sample input data airay. If no 

15 transients exist the input data may be output (or passed back to a low-bit rate audio 
coder) with no time scaling processing perfoiined. If transients do exist, the number 
of transients that exist in tlie cturent N samples of audio data and their location (s) are 
passed to the audio time scaling processing portion 210 of the process for temporal 
modification of the input audio data. The result of suitable tune-scale processing is 

20 described in coimection with tlie description of FIGS. 8a-8e. Note that the process 
requii es infonnation from the encoder as to, for example, the location of the 
windowed sample blocks with respect to the audio data stream. If, optionally, time 
scaling metadata infonnation is output (as shown in FIG. 6), for the case of no 
transients it would indicate that no pre-processing was peifonned. Time scaling 

25 metadata may include, for example, time scaling pai ameters such as the location and 
amount of time scaling peifonned and, if cross fading of spliced audio segments is 
employed by the time scaling teclmique, the cross fade length. Metadata in the 
encoded audio bit sti eam may also include infonnation about transients, including 
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their location after and/or before and after temporal shifting. Audio data is output in 
step 212. 

Audio Pre-Processing 
FIGS. 8a-8e rllustiate an example of audio time scaling pre-processing in 
5 accordance with aspects of tlie present invention when a transient exists in an audio 
coding block and is located closer to the last windowed block end than to the next 
windowed block end. For this example, a 50% block overlap is assumed, in tlie 
manner of FIGS, la-le and FIGS. 4a and 4b. As discussed previously, to reduce the 
amoiuit of transient pre-noise inti oduced by low bit rate audio coding, it is desued to 

10 adjust the time evolution of the input audio signal such that the audio signal transient 
is located closely following the last windowed block end. Such a shift in the 
ti ansient location is prefeired because it minimizes the disraption to the time 
evolution of the signal stream while optimally limiting the length of the transient pre- 
noise. However, as discussed above, a shift to a location closely following the next 

15 windowed block end also optunally limits the length of the ti^ansient pre-noise but 

does not minimize disraption to the signal stieaiu's time evolution. In some cases the 
difference is dismption may be of little or no audible significance, paiticulaiiy if time 
evolution compensation is also employed. Thus, a shift to either of the closest block 
ends is contemplated by the present invention in the present example and in other 

20 examples herein. As mentioned above, the ti ansient time shifting time scaling need 
not be accomplished witliin a single block unless the processing is perforaied after 
the audio signal stream is divided into blocks by the encoder. 

FIG. 8a shows tluee consecutive 50% overlapped windowed coding blocks. 
FIG. 8b shows the relationship between the original input audio data stream, 

25 containing a single tiansient and the wmdowed audio coding blocks. The onset of 
the tiansient is T samples after the preceding block end. Because the transient is 
closer to the preceding block end than the next block end, it is prefeired to shift the 
tiansient to the left to a location closely following tlie preceding block end by 
applying time compression that has the effect of deleting T samples prior to the 
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transient. FIG. Sc shows two regions in the audio sdeam where audio time scaling 
may be perfonned. The first region corresponds to the audio samples before tlie 
ti ansient where reducing tlie dui'ation of the audio by T sainples "slides" or shifts the 
position of the ti ansient left to the desired location closely following the end of the 

5 preceding block by providing time compression. As in FIGS. 2 A through 5B and 
other figures to be described, the spacing of the transient from the block end in FIGS. 
8d and 8e is exaggerated in the figm e for clarity of presentation. The second region 
shows the region where time scaling optionally may be performed after the transient 
to increase the duiatiou of tlie audio by T samples by providing time expansion so 

10 tliat the overall length of tlie audio data remains at N samples. Although the deletion 
of T samples and the optional sample nmnber compensating addition of T samples 
aie both shown as occuning within a windowed audio coding sample block, this is 
not essential — tlie compensating time-scaling processing need not occur within a 
single audio coding block unless the transient time shifting is perfonned after the 

15 audio signal stieam is divided into blocks by the encoder. The optimum location for 
such time-scaling processing may be detennined by the time-scaling process 
employed. Because tlie tiansient may provide useful post-masking, sample number 
compensating time scaling preferably is done close to the transient. 

FIG. 8d demonsti ates tlie resulting signal stream if time scaling processing is 

20 perfonned on the input audio data stream by reducing the time duration of the audio 
input data stieam by T samples in the ai*ea before the transient and no sample number 
compensating tune scale expansion is perfonned after the tiansient signal. As 
discussed previously, slight vaiiations in the temporal evolution of an audio signal 
ai e not discemable to most listeners. Therefore, if it is not required for the number of 

25 time scaled audio data sti eam samples to equal the nmnber of input samples, N; it 
may be sufficient only to process the audio stream before the tiansient. FIG. 8e 
illustrates tlie case when tlie audio data stieam before the ti ansient is reduced in 
duration by T samples and the audio data stieam following the transient is increased 
by T samples, tliereby maintaining N audio samples in and out of the time scaling 
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processing block and restoring the time evolution of tlie audio signal stieam except 
for the irajisient and poitions of the signal sti eam close to tlie ti ansient. The 
variations in lengths of the signal wavefonns in FIGS. 8b-8e are intended to show 
schematically that the number of samples in the audio data stieam varies for the 
5 described conditions. When the number of audio samples is reduced, as in FIG. 8d, 
additional samples may need to be acquired before additional audio coding can be 
perfonned. This may mean reading more samples fiom a file or waiting for more 
audio to be buffered in a real-time system. 

FIGS. 9a-9e illusti ate an example of audio time scaling processing when a 

10 transient exists in a windowed audio coding block and is located approximately T 
samples before a block end. To reduce the amount of transient pre-noise inb oduced 
by low bit rate audio coding wliile minimizing the transient shift, it is prefened to 
temporally adjust the input audio signal such that the audio signal transient closely 
follows tlie next block end. In tlie case of 50% overlapped blocks, a sliift to the end 

15 of die next block end (or the previous block end) limits tlie tiansient pre-noise to the 
first half of an audio coding block, instead of spreading the transient pre-noise 
througliout that block and the previous audio block. 

FIG. 9a shows thiee consecutive 50% overlapped windowed coding 
blocks. FIG. 9b shows the relationship between the original input audio data, 

20 containing a single ti ansient and the audio blocks. The onset of the ti ansient is T 
samples before the next block end. Because the transient is closer to die next block 
end tlian tlie previous block end, it is prefeired to shift tlie ti ansient to the right to a 
location closely following the next block end by applying tune expansion that has the 
effect of adding T samples prior to the transient. FIG. 9c shows two regions where 

25 audio time scaling may be perfonned. The fijst region coiresponds to the audio 

samples before the t ansient where increasing tlie duration of the audio by T samples 
slides the position of the ti ansient to the desii'ed location closely after the next block 
end. FIG. 9c also shows the region where tune scaling may be perfoimed after the 
transient to reduce the duration of the audio by T samples so tliat the overall length of 
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the audio data steam, N samples, remains constant. FIG. 9d demonstiates the result 
if time scaling processing is perfonned on tlie input audio data sb eam by increasing 
the time dm ation of the audio mput data stream by T samples in the tune region 
before the ti*ansient but without perfonning sample number compensating time scale 
expansion after the transient signal. As discussed previously, slight variations in the 
temporal evolution of an audio signal are not discemable to most listeners. 
Therefore, if it is not required for the number of audio stream samples after time 
scaling to equal tlie input, N. It may be sufficient only to process the audio before 
tlie tiansient. 

FIG. 9e illustiates tlie case when the audio prior to the tiansient is increased in 
duration by T samples and the audio following the tiansient is reduced by T samples, 
thereby maintaining a constant number of audio samples before and after time 
scaling. As in other figm es, the spacing of the tiansient fiom the block end in FIGS. 
9d and 9e is exaggerated in the figm es for claiity of presentation. 

Audio Time Scaling Processing for Multiple Transients 

Depending upon the length of the audio coding block size and the content of 
the audio data being coded, it is possible for an mput audio data.stieam being 
processed to contaui, within tlae N samples being processed, more than one ti ansient 
signal tliat may intioduce pre-noise aitifacts. As mentioned above, the N samples 
being processed may include more than an audio coding block. 

FIGS. 10a- lOd illushate processing solutions when two tiansients occur in an 
audio coding block. In general, two or more transients may be handled in the same 
manner as a single tiansient, with the earliest.transient in the audio data stream being 
treated as the ti ansient of interest. 

FIG. 10a shows thiee consecutive 50% overlapped windowed coding blocks. 
FIG. 10b shows the case where two tiansients in the input audio stiaddle the end of 
an audio coding block. For this case, the eailier ti ansient mtioduces the most 
perceptible pre-noise because a portion of tlie pre-noise resulting from the second 
transient is post-masked by the fust tiansient. To minuuize the pre-noise artifacts. 
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the input audio signal may be time scaled to shift the fii st ti*ansient to the right such 
that die audio before die first tiansient is time scale expanded by T samples, where T 
is tlie number of samples wluch places tlie fu st tiansient to a position closely 
following the next block end. 

5 In order to sample number compensate for the time scale expansion processing 

before the fii st transient in FIG. 10b and to optimize post-masking of the pre-noise 
resulting from the second ti ansient by moving the transients more closely together in 
time, the audio following the first ti ansient and before the second ti^ansient preferably 
is time scaled to be reduced in dm ation by T samples. As illusti ated in FIG. 10b, 

10 there is sufficient audio processing data between the fii st and second transients to 
perfonn time scale processing. However, in some cases it may be that the second 
transient is so close to the first tiansient that there is not enough audio data to 
perfonn tune scale processing between tliem. Tlie amount of audio data requii ed 
between tiansients is dependent upon the time scaling process used for the 

15 processing. If insufficient audio data exists between tiie two ti ansients, it may be 
necessaiy to time scale expand the audio data following the second transient in order 
to provide sample number compensation. In order to accomplish expansion of the 
audio data after the second transient, may be necessary for the tune scaling process to 
have access to a lai ger segment of audio data than tlie number of samples in a block 

20 used in tlie audio coding process, as mentioned above. 

FIG. 10c illusti'ates tlie case when tlie first ti smsient is closer to the last block 
end tlian the next block end and all of the ti ansients (in this case two) are sufficiently 
close togetlier that die pre-noise resulting from the fust tiansient is substantially post- 
masked by tlie first ti ansient. Thus, the audio sti eam prior to the first transient 

25 preferably is tune scale compressed by T samples so that the fii st t ansient is shifted 
to a location just after die prior block end. Sample number compensation to restore 
tlie original number of samples, in the fonn of time scale expansion, may be 
perfonned in the audio data stieam following the second tiansient. 
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FIG. lOd illustiates the case when the first ti ansient is closer to the next block 
end than tlie last block end and all of the ti-ansients (in this case, two) are sufficiently 
close together that the pre-noise resulting fiom the second is substantially post- 
masked by tlie first b ansient. Thus, the audio sti eam prior to tlie first transient is 
5 time scale expanded by T samples so tliat the first tiansient is sliifted to a location 
just after the next block end. Sample number compensation, in the form of time scale 
compression, optionally may be performed in the audio data stream following the 
second ti'ansient. 

For the multiple ti ansient case, if it is desii ed to tune evolution compensate for 
10 pre-processing in a neai- perfect maimer, metadata infonnation may be conveyed with 
each coded audio block in a maimer sunilar to the single transient case described 
above. 

Metadata Controlled Time Evolution Compensation 
of Time Scaling Pre-Processing 

15 As mentioned above, it may be desuable to apply, subsequent to inverse 

transfonnation by tlie decoder, a compensating time scaling to the audio signal 
stream after the ti ansient such that the time evolution of the processed audio signal 
stream is substantially tlie same as diat of the original audio signal sti eam, thus 
restoiing the original time evolution of the signal stream. However, expeiimental 

20 studies have shown that slight temporal modifications of audio are not perceptible to 
most listeners and therefore time evolution compensation may not be necessary. 
Also, on average, tiansients aie advanced and retarded equally and, thus, over a 
sufficiently long time period, die cmnulative effect without tune evolution 
compensation may be negligible. Another issue to be considered is that depending 

25 upon the type of time scaling used for pre-processing, the additional time evolution 
compensating processing may intioduce audible aitifacts in the audio. Such aitifacts 
may ai ise because time scaling processing, in many cases, is not a perfectly 
reversible process. In other words, reducing audio by a fixed amount using a time 
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scaliiig process and then time expanding tlie same audio later may introduce audible 
aitifacts. 

One benefit of processing audio that contains transient material by time scaling 
is that time scaling aitifacts may be masked by the temporal masking properties of 

5 transient signals. An audio b ansient provides both foi^ward and backward temporal 
masking. Transient audio material "masks" audible material both before and after 
the transient such that the audio directly preceding and following is not perceptible to 
a listener. Pre-masking has been measured and is relatively short and lasts only a 
few milliseconds while post-masking may last longer than 100 msec. Therefore, 

10 time-scaling time evolution compensation processing may be inaudible due to 
temporal post-masking effects. Thus, if perfonned, it is advantageous to perfoim 
tim e evolution compensation time-scaling within temporally masked regions. 

FIGS. 1 la-1 If shows an example where intelligent tune evolution 
compensation is peiforaied following inverse transforaiation in the decoder using 

15 metadata infonnation. The metadata greatly reduces the amount of analysis required 
to perfoiTO time evolution compensation because it indicates where time scaling 
processing should be perfonned and the dm ation of tune scaling required. As 
explained above, time evolution compensating processing is intended to retmn the 
decoded audio signal to its original temporal evolution in which the signal stream, 

20 Lncluding the tiansient, has its original location in the audio stream. FIG, 11a shows 
thiee consecutive 50% overlapped windowed coding blocks. FIG. lib shows an 
input audio stieam prior to pre-processing having a ti ansient T samples after a block 
end. FIG. 11c shows that the input audio sbeam is processed by deleting T samples 
prior to the tiansient to shift the ti ansient to an eailier location. T samples aie added 

25 after the ti ansient in order to leave the number of audio data sample unchanged 

(sample number compensation). FIG. 1 Id shows the modified audio stieam in which 
tlie ti ansient is shifted to an eaiiier location and audio following the tiansient is 
shifted back to its original location. FIG. 1 le shows tiie required time evolution 
compensating time scaling regions in which the deletion of T samples (time 
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compression) is compensated by adding T samples (time expansion) and the addition 
of T samples (time expansion) is compensated by deleting T samples (time 
compression). The result, shown in FIG. 1 If is a compensated "near perfect" output 
signal having the same time evolution as the input signal of FIG. 11a (subject mainly 

5 to imperfections in the time scaling processes). 

Time Scaling Post-Processing to Reduce Transient Pre-noise 
As demonstrated in a nmnber of previous examples, even with optimal 
placement of a transient in an audio coding block,iSome pre-noise is still introduced 
by the low ]bit rate audio coding system process. . As was stated above, longer audio 

10 coding blocks ai:e prefened.over shorter coding blocks because they provide greater 
frequency resolution and^-increased coding gain. However, even if transients are 
optimally placed by time. scaling prior to audio encoding (pre-processing), as the 
length of the audio coding block increases, the pre-noise also increases. Pre-masking 
of transient temporal pre-noise is on the order of 5 msec (milliseconds), which 

15 conesponds to 240 samples for audio sampled at 48 kHz. This implies that for 

coders with block sizes greater than approxiinately .512. samples, transient pre-noise 
begins to be audible even with optimal placement (only half is masked in the case of 
50% overlapped block), (This does not take into account the reduction of transient 
pre-noise caused by windowing edge effects in the coder^s blocks.) 

20 While tiansient pre-noise may not be removed entii ely from a low bit rate 

coding system, it is possible to perfoiin time scaling post-processing (by itself or in 
addition to pre-processing) on audio data that has undergone inverse transfoimation 
in a ti ansfoim-based low bit rate audio decoder to reduce the amount of transient pre- 
noise whetlier or not pre-processing is also applied. Time scaling post-processing 

25 may be perforaied either in conjunction widi a low bit rate audio decoder (/. e., as part 
of the decoder and/or by receiving metadata from the decoder and/or from the 
encoder via die decoder) or as a stand-alone post-process. Using metadata is 
prefened because useful infonnation such as the location of transients in relation to 
audio coding blocks as well as die audio coding block length(s) ai e readily available 
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and may be passed to the post-processing process via the metadata. However, post- 
processing may be used without interaction with a low bit rate audio decoder. Both 
metliods aie discussed below. 

Time Scaling Post-Processing in Conjunction with 
5 a Low Bit Rate Audio Decoder (Receiving Metadata) 

FIG. 12 is a flowchart of a process for perforaiing time scaling post-processing 

in conjunction with a low bit rate audio decoder to reduce tiansient pre-noise 

artifacts. The process illustrated in FIG. 12 assmnes that the input data is low bit rate 

encoded audio data (step 802). Following decoding of the compressed data into 

10 audio (step 804), the audio conesponding to a block (or blocks) is sent to tlie time 
scaler 806 along with metadata infonnation useful in reducing tlie transient pre-noise 
duration. This infonnation may mclude, for example, the location of tiansients, the 
lengtli of the audio coder block(s), the relation of the coder block boundaries to the 
audio data, and a desired length of the ti ansient pre-noise. If tlie location of the 

15 transients in relation to tlie audio coder's block borders is available, the length and 
location of the pre-noise aitifact may be estimated and accurately reduced by post- 
processing. Since ti ansients do provide some temporal pre-masking, it may not be 
necessary to completely remove the transient pre-noise. By, giving the time scaling 
post-processing process a desired pre-noise lengtli, some control over the amount of 

20 pre-noise that is left in the output audio output by step 808 may be achieved. The 
results of suitable time-scale processing for step 806 is described below in 
comiection with die description of FIGS. 13a-13c. 

Note that post-processing may be useful whether or not pre-processing has 
been applied prior to encoding. Regai dless of where the ti ansient is located with 

25 respect to block ends, some transient pre-noise exists. For example, at a minimum it 
is half the length of the audio coding window for the case of 50% overlap. Lai ge 
window sizes still may intioduce audible artifacts. By peifoiining post processing, it 
is possible to reduce the length of the pre-noise even more than it was reduced by 
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optimally placing the t ansient witlx respect to block ends prior to quantization by the 
encoder. 

FIGS. 13a- 13c illustrate an example of post-processing for a single transient to 
reduce the pre-noise aitifact present after inverse transfonnation. As shown in FIG. 

5 13a, a single tiansient introduces a pre-noise aitifact. Depending on the coding 
block length, the pre-noise, even after pre-processing, if any, may have a longer time 
than may be masked by tiansient temporal pre-masking effects. However, as shown 
in FIG. 13b, by using tlie tiansient location metadata infonnation fiom the decoder, 
one may identify a region of audio containing the pre-noise in which the pre-noise 

10 may be reduced in length by time scaling the audio to reduce tlie pre-noise by T 
samples. The number T^may be chosen such that the pre-noise length is minimized 
to take advantage of pre-masking or may be chosen so as to remove the pre-noise 
completely or neaily completely. If it is desked to maintain the same number of 
samples as in the original signal, the audio following the tiansient may be time scale 

15 expanded by +T samples. Alteniatively, as shown in connection with the example of 
FIG. 16 A, such sample number compensation may be applied prior to the pre-noise, 
which has tlie advantage of also providing time evolution compensation. 

It should be noted that if post-processing is perforaied in conjunction with tune 
scaling pre-processing, one may minimize the amount of fuither disraption to the 

20 output audio sti*eam's time evolution. Since the time scaling pre-processing 

discussed eaiiier reduces tlie length of the pre-noise to N/2 samples for the case of 
50% block overlap (where N is the length of flie audio coding block) one is 
guaianteed to inti bduce less than N/2 samples of further time evolution dismption in 
the output audio as compaied to the origmal input audio, hi the absence of pre- 

25 processing, the pre-noise can be up to N samples, the coding block length, for the 
case of 50% block overlap* 

In some low bit rate audio coding systems, the location of tlie signal transients 
may not be readily available if the encoder does not convey die location infoiination. 
If that is the case, the decoder or the time scaling process may, using any number of 
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tiansient detection processes or the efficient method described previously, perfonn 

transient detection. 

For multiple tiansients, the same issues apply as for pre-processing, as 

discussed above. 

5 Time Scaling Post-Processing 

Without Pre-Processing 

As mentioned above, in some cases it may be desiied to improve the perceived 
quality of audio that has undergone low bit rate coding using compression systems 
that do not implemexit ti ansient pre-noise time scaling processing (pre-processing). 
10 FIG. 14 outlines a process for doing that. 

Tlie first step 1402 checks for the availability of N audio data samples diat 
have undergone low bit rate audio encoding and decoding. These audio data samples 
may be fiom a file on a PC-based haid disk or fiom a data buffer in a haidwaie 
device. If N audio data samples are available, they ai e passed to the time scaling 
15 post-processing process by step 1404. 

The third step 1406 in the time-scaling post-processing process is the 
identification of the location of audio data ti ansient signals that aie likely to 
introduce pre-noise aitifacts. Many different processes are available to perfonn this 
function and the specific implementation is not important as long as it provides 
0 accm ate detection of tr ansient signals that ai e likely to inh oduce pre-noise artifacts. 
However, the process described above is an efficient and accurate method tliat may 
be used. 

The fouitli step 1408 is to detennine whether ti ansients exist in the cmrent N 
sample input data airay as detected by step 1406. If no transients exist, tlie input data 
5 may be output by step 1414 with no time scaling processing perfonned. If ti ansients 
exist, tlie number of ti ansients and their location(s) ai'e passed to the t ansient pre- 
noise estunation-processing step 1410 of the process to identify the location and 
dm ation of the tiansient pre-noise. 
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The fiftli and sixth steps 1410 in processing involve estimating the location 
and duration of the transient pre-noise aitifacts and reducing their length with time 
scaling processing 1412. Since, by definition, the pre-noise aitifacts ai"e lunited to 
the regions preceding ti ansients in the audio data, the search area is limited by the 

5 infonnation provided by the t ansient detection processing. As shown in FIG. 1, the 
lengtli of the pre-noise is lunited fi om a minimum of N/2 to a maximmn of N 
samples where N is the nmnber of audio samples in a 50% overlapped audio coding 
block. Thus, when N is 1024 samples and audio is sampled at 48 kHz, transient pre- 
noise may range fi om 10.7 msec to 21.3 msec before the onset of the transient, 

10 depending on the tiansient location in the audio stream, which significantly exceeds 
any temporal masking tliat may be expected from transient signals. Alternatively, 
instead of estimating tlie length of the pre-noise artifacts preceding a transient, step 
1410 may apply assume tliat tlie pre-noise aitifacts have a default length. 

Two approaches for transient pre-noise reduction may be implemented. The 

15 first assumes diat all tiansients contain pre-noise and tlierefore the audio before every 
transient may be time scaled (time compressed) by a predetennined (default) amount 
that is based on an expected amount of pre-noise per transient. If this technique is 
used, time scale expansion of the audio prior to the temporal pre-noise may be done 
to provide both sample nmnber compensation for the time compression tune scaling 

20 processing employed to reduce the length of the pre-noise and to provide time 

evolution compensation (time expansion prior to the pre-noise tliat compensates for 
time compression within the pre-noise leaves the transient in or neaiiy in its original 
temporal location). However, if the exact location of the stait of tlie pre-noise is not 
known, such sample munber compensation processing may unintentionally increase 

25 the duration of paits of tlie pre-noise component. 

FIGS. 15a-15c demonsbate a technique that uses a default value to time-scale 
the audio before each ti aiisient to reduce pre-noise duration but does not perfonn 
sample number compensation. As shown in FIG. 15a, an audio signal stieam from a 
low bit rate audio decoder has a tiansient preceded by pre-noise. FIG. 15b shows a 
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default processing length used as the amount of time compression to be perforaied by 
the time scaling processing. FIG. 1 5c shows tlie resulting audio signal stream having 
reduced pre-noise. In diis exainple, time evolution compensation is not perfonned to 
return tlie ti ansient to its original location in the audio data stream. However, in a 

5 maimer similar to previous processing examples, if a constant nmnber of input to 
output samples aie desiied, time scale expansion processing following the transient 
may be perfonned, similai' to the example of FIG. 13b or, possibly, before the pre- 
noise as described below in connection with tlie example of FIGS. 16a-16c. 
However, when applying a default processing length, providing such compensation 

10 prior to the pre-noise mns tlie risk of perfoiming the time scale expansion processing 
within the pre-noise (thus, undesirably increasing the pre-noise length) if the actual 
length of the pre-noise exceeds . tlie default length. Moreover, in some cases, the 
post-processing may not have access to the audio stieam prior to the pre-noise - the 
audio may already be output in order to reduce latency. 

15 A second post-processing pre-noise reduction technique, illustrated in FIGS. 

16a- 16c, involves perfoiming analysis of the pre-noise resulting fi'om a transient to 
detennine its length and processing the audio so that only the pre-noise segment is 
processed. As noted above, tiansient pre-noise is produced when tlie high fi equency 
components of ti ansient audio material are temporally smeai'ed thj oughout a block as 

20 a result of the quantizing process in the encoder. Therefore one stiaight-forward 
method of detection is to high pass filter the audio prior to a transient and measure 
the high fiequency energy. The start of die tiansient pre-noise is identified when the 
noise-like, liigh frequency pre-noise related to and caused by the ti ansient exceeds a 
predetennined tlu eshold. When the size and location of the ti ansient pre-noise is 

25 known, compensating time scale expansion of tlie audio may be perfonned prior to 
time scale reduction of tlie pre-noise to return the audio to its original temporal 
evolution and to restore the time evolution of the audio sti eam substantially to its 
original condition. Tlie invention is not limited to employing high fiequency 



30OCID: <WO_02093560A1J.> 



wo 02/093560 



PCT/US02/12957 



-38- 

detection. Other tecliiiiques for detecting or estimating the length of tlie pre-noise 
may be employed. 

In FIG. 16a, an audio signal stream from a low bit rate audio decoder has a 
transient preceded by pre-noise. FIG. 16b shows a time compression jprbcessing 

5 lengfli used as tlie amount of time scale reduction to be perfonned by the time scaling 
processing based on an estimated pfe-noise length as measured by the high frequency 
audio content in the block. FIG. 16b also shows the use of tune expansion by T 
samples in order to restore tlie original time evolution of the signal sb eain and also to 
restore tlie original number of samples. FIG. 16c shows the resulting audio signal 

10 stream having reduced pre-noise along with the original tinie evolution and the same 
number of samples as the original signal stream. 

Tlie present invention and its various aspects may be implemented as software 
functions perfonned in digital signal processors, progranuned general-puipose digital 
computers, and/or special puipose digital computers. Interfaces between analog and 

15 digital signal stieains may be perfonned in appropriate hardware and/or as functions 
in software and/or finn ware. ' 
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1 . A method for reducing distoition aitifacts preceding a signal fransient in an 
audio signal stieam processed by a transfonn-based low-bit-rate audio coding system 

5 employing coding blocks, comprising 

detecting a ti ansient in die audio signal sheam prior to processing by said 
coding system, and 

shifting the temporal relationship of said tiansient with respect to said coding 
blocks such that the time duration of said distortion artifacts is reduced. 

0 

2. Tlie method of claim 1 wherein said sliifting shifts die temporal relationship 
of said ti ansient with respect to said coding blocks prior to forwai d tiansfomiing in 
the encoder of said coding system. 

5 3. Tlie method of claim 2 wherein said transient is shifted to a temporal 

position closely following the next block end or closely following the last block end. 

4. The mediod of claim 3 wherein said tiansient is sliifted to a temporal 
position closely following tlie next block end or closely following the last block end 

10 wliicli results in tlie shorter shift of temporal position. 

5 . A method according to claim 1 or claun 3 fuither comprising removing at 
least a portion of remaining distortion artifacts after inverse tansfonnation in the 
decoder of said coding system. 

25 

6. The method of claim 5 wherein the portion of remaining distortion ajlifacts 
is detennined at least in pait by metadata infonnation caiiied in said coding system. 
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7. Tlie method of claim 5 wherein the portion of remaining distoition aitifacts 
is deteiinined at least in pait by a default paiameter. 

8. The metliod of claim 5 wherein the portion of remaining distortion artifacts 
is deteraiined at least in pait by a measure of high frequency audio components in 
said audio signal steam. 

9. Tlie metliod of claim 2 or claim 3 wherein the temporal relationship of said 
transient witli respect to said coding blocks is shifted by time scaling a segment of 
said audio signal stieam preceding said signal transient. 

> 

10. Tlie method of claim 9 fuither comprising applying a compensating time 
scaling to the audio signal sti eam subsequent to inverse transfoimation in the decoder 
of said coding system such that the time evolution of the processed audio signal 
stream is substantially the same as that of the audio signal stream prior to said 
shifting, 

1 1. Hie method of claim 10 wherein said compensating time scaling is applied 
to a segment of said audio signal sti eam preceding said signal transient 

12. The method of claim 10 wherein said coding system includes an encoder 
and a decoder, said encoder b ansmitting metadata to said decoder along with an 
encoded version of said audio signal stream, said metadata including infoimation 
useful for applying said compensating time scaling. 

13. The method of claim 9 wherein said time scaling is perfonned on a 
segment of said audio sti eam closely preceding said transient. 
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14. The method of clahn 13 wherein said time scaling is perfonned on a 
segment of said audio sd eam tliat is at least paitially . temporally pre-masked by 
tiansient. 

5 15. The method of claim 9 wherein said time scaling has the effect of deleting 

signal components from or adding signal components to the audio signal stream 
applied to the coding system. 

16. The method of claim 15 wherein a fuither time scaling is applied 

10 following said signal tiansient, said fuither time scaling acting in Uie opposite sense 
to the said fust-recited tune scaling. 

17. The method of claim 16 wherein said fuither time scaling is applied prior 
to foiward ti ansfoiining in the encoder of said coding system. 

15 

18. The method of claim 16 wherein said fuither tune scaling is applied 
subsequent to inverse tiansfonnation in the decoder of said coding system. 

19. The metliod of claim 16 wherein tlie time duration of the signal 

20 components added or deleted by said furtlier time scaling is substantially the same as 
tlie time duration of signal components deleted or added by said first-recited time 
scaling, respectively, whereby the time duration of said audio signal stieam is 
substantially unchanged. 

25 20. The method of claim 15 fuither comprising applying compensating time 

scaling to die audio signal sti eam preceding said distortion aitifacts, which precede 
said transient, and subsequent to inverse ti ansfonnation in tlie decoder of said coding 
system such tliat tlie time evolution of the processed audio signal sti eam is 
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substantially the same as that of the audio signal stieam' prior to said shifting and the 
time duration of said audio signal stieam is substantially michanged. 

2 1 . The method of claun 20 wheiein said coding system includes an encoder 
5 and a decoder, said encoder transmitting metadata to said decoder^ said metadata 

including infbnnation useful for applying said compensating time scaliiigs. 

22. The method of claim 9 wherein said audio signal stream applied to the 
coding system' is a digital signal stieaiii in which the audio information is represented 

10 by samples, the order of said saiiiples representing time, and wherein said time 

scaling has the effect of deleting samples fiom or adding samples to the digital signal 
stream applied to the coding system. 

23. Tlie method of claim 9 wherein a further time scaling is applied following 
15 said signal ti ansient, said further time scaling acting in the opposite sense to tlie said 

fu-st-recited time scaling. 

24. The method of claim 23 wherein said further time scaling is perfonned on 
a segment of said audio sti eam closely following said ti*ansient. ' 

25. The itieiliod of claim 24 wherein said tiiiie scaling is perfonned on a 
segment of said audio stream tliat is at least paitially tenfiporally post-masked by 
transient. 

25 ' 26. Tlie metliod of claim 23 wherein said first-recited time scaling has tlie 

effect of deleting sighkl coiriponents fi'om or adding signal components to the audio 
signal strbain applied to tlie dodiiig system and said fuither time scaluig has the effect 
of adding signal components to the audio signid stiWain when said' fu st-recited time 
scaling deletes signal components and said fuither time scaling has the effect of 



wo 02/093560 



PCT/US02/12957 



-43- 

deletiiig signal components to the audio signal stieam when said fii'st-recited time 
scalmg adds signal components. 

27. Tlie method of claim 26 wherein the time duration of tlie signal 
5 components added or deleted by said furtlier time scaling is substantially tide same as 
tibie time duration of signal components deleted or added by said first-recited time 
scaling, respectively, whereby the time duration of said audio signal sbeam is 
substantially unchanged, 

10 28. The method of claim 23 wherein said audio signal sti eam applied to the 

coding system is a digital signal stieam in which the audio infonnation is represented 
by samples, the order of said samples representing time, and wherein said fii st- 
recited time scaling has tlie effect of deleting samples fi'om or adding samples to the 
digital signal stieam applied to the coding system and said fuitlier tune scaling has 

1 5 the effect of adding samples to tlie digital signal stieam when said fii st-recited time 
sampling deletes samples fiom the digital signal stream and said further tune scaling 
has tlie effect of deleting samples from the digital signal sti eam when said first- 
recited time sampling adds samples to the digital signal stream. 

20 29. The method of claim 1 wherein said detecting detects multiple transients 

and said sliifting shifts the temporal location of the fii st of said ti ansients to reduce 
distortion aitifacts prior to the first of said ti ansients. 

30. The metliod of claim 29 wherem the temporal location of the fu st of said 
25 ti ansients with respect to said coding blocks is shifted by time scaling said audio 

signal sti eam preceding die first of said signal ti ansients. 

31. The method of claim 30 wherein a fuitlier time scalmg is applied 
following the fu st of said tiansients and before one or more other of said multiple 
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transients, said fiuHier time scaling acting in the opposite sense to tlie said first- 
recited time scaling, fp. . , 

32. The metliod of claim 30 wherein a fuither time scaling is applied 
following said tiansients, said fuitlier tune scaling acting in the opposite sense to the 
said filst-recited. time scaling. 

33. In a decoder of a ti ansfonn-based low-bit-rate audio coding system 
employing coding blocks, a method for reducing distoition artifacts preceding a 
signal ti ansient in an audio signal sti eam subsequent to inverse transfoimation, 
comprising :;.;.;> ; .i . 

detecting a tiansient^in the audio signal stieam,. and . 
time compressing at least a portion, of said distortion aitifacts.such that the 
time duration of said distoition aitifacts is reduced;. 

• 34. fThe method ofi claim 33 wherein the poition of die distoition aitifacts is 
detennined; at least m.pait.by the location, of the detected transient and a default 
parameter. ; ..;(;. , . . i . ^ , 

3 5. The method of claim 33 the poition of the distoition aitifacts is 
deteimined at least in pait by the location of the detected transient and signal 
characteristics preceding said tiansient'f .it ^ • . i 

36. The method of claim 35 wherein said signal chaiacteristics include a 
measure of high-fi:equency components of the audio signal sti eam. 

37. The mediod of claim 34 or 35 fuither comprising time expanding prior to 
said time compression such that the time evolution and length of the audio signal 
stream is substantially unchanged. - 
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38. Tlie method of claiin 34 or 35 fuither comprising time expanding 
subsequent to said time compression such that the length of the audio signal sti eam 
substantially unchanged. 
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AMENDED CLAIMS 
[received by the International Bureau on 04 October 2002 (04,10.02); 
original claims 1-38 replaced by amended claims 1-37 (7 pages )] 

1. A method for reducing distortion artifacts preceding a signal transient in an 
audio signal stream processed by a transform-based low-bit-rate audio coding system 
employing coding blocks, comprising 

detecting a transient in the audio signal stream prior to processing by said 
coding system, and 

shifting the temporal relationship of said transient with respect to said coding 
blocks by time scaling a segment of said audio signal stream preceding said signal 
transient such that the tirne duration of said distortion artifacts is reduced. 

2. The metliod of claim 1 wherein said shifting shifts the temporal relationship 
of said transient with respect to said coding blocks prior to forward transfoiming in 
the encoder of said coding system. 

3. The method of claim 2 wherein said transient is shifted to a temporal 
position closely following the next block end or closely following the last block end. 

4. The method of claim 3 wherein said transient is shifted to a temporal 
position closely following the next block end or closely following the last block end 
which results in the shorter shift of temporal position. 

5. A method according to any one of claims 1-4 further comprising removing 
at least a portion of remaining distortion aitifacts after inverse transformation in the 
decoder of said coding system. 

6. The method of claim 5 wherein the poition of remaining distoition aitifacts 
is detennined at least in pait by metadata infonnation carried in said coding system. 

AMENDED SHEET (ARTICLE 1 9) 
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7. The method of claim 5 wherein the portion of remaining distoition artifacts 
is deteraiined at least in part by a default parameter, 

8. The method of claim 5 wherein the portion of remaining distortion artifacts 
5 is determined at least in part by a measure of high frequency audio components in 

said audio signal steam. 

9. The method of claim 1 further comprising applying a compensating time 
scaling to the audio signal stream subsequent to inverse transfoimation in the decoder 

10 of said coding system such that the time evolution of the processed audio signal 
stream is substantially the same as that of the audio signal stream prior to said 
shifting. 



10. The method of claim 9 wherein said compensating time scaling is applied 
15 to a segment of said audio signal stream preceding said signal transient. 

1 1 . The method of claim 9 wherein said coding system includes an encoder 
and a decoder, said encoder transmitting metadata to said decoder along with an 
encoded version of said audio signal stream, said metadata including information 

20 useful for applying said compensating time scaling. 

12. The method of claim 1 wherein said time scaling is perfonned on a 
segment of said audio sti eam closely preceding said transient. 

25 13. The method of claim 12 wherein said time scaling is perfonned on a 

segment of said audio stream that is at least paitially temporally pre-masked by 
transient. 
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14. The method of claim 1 wherein said time scaling has the effect of deleting 
signal components from or adding signal components to the audio signal stream 
applied to the coding system. 

5 15. The method of claim 14 wherein a further time scaling is applied 

following said signal transient, said furtlier time scaling acting in the opposite sense 
to the said first-recited time scaling. 

16. The method of claim 15 wherein said further time scaling is apphed prior 
10 to fonvai d transforming in the encoder of said coding system. 

17. The method of claim 15 wherein said further time scaling is appUed 
subsequent to inverse transfonnation in the decoder of said coding system. 

15 18. The method of claim 15 wherein the time duration of the signal 

components added or deleted by said further time scaling is substantially the same as 
the time duration of signal components deleted or added by said first-recited time 
scaling, respectively, whereby the time duration of said audio signal stream is 
substantially unchanged. 

20 

19. The method of claim 14 further comprising applying compensating time 
scaling to the audio signal stream preceding said distortion artifacts, which precede 
said transient, and subsequent to invei-se ti ansfonnation in die decoder of said coding 
system such that the time evolution of the processed audio signal sb eam is 
25 substantially the same as that of the audio signal stieam prior to said shifting and die 
tune duration of said audio signal stream is substantially unchanged. 
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20. The method of claim 19 wherein said coding system includes an encoder 
and a decoder, said encoder transmitting metadata to said decoder, said metadata 
including information useful for applying said compensating time scalings. 

5 21. The method of claim 1 wherein said audio signal stream applied to the 

coding system is a digital signal stream in which the audio information is represented 
by samples, the order of said samples representing time, and wherein said time 
scaling has the effect of deleting samples fi om or adding samples to the digital signal 
stieam applied to die coding system. 

10 

22. The method of claim 1 wherein a further time scaling is applied following 
said signal transient, said further time scaling acting in the opposite sense to the said 
first-recited time scaling. 

15 23. The method of claim 22 wherein said further time scaling is performed on 

a segment of said audio stream closely following said transient. 

24. The method of claim 23 wherein said time scaling is performed on a 
segment of said audio stream that is at least partially temporally post-masked by 

20 transient. 

25. The method of claim 22 wherein said first-recited time scaling has the 
effect of deleting signal components fi om or adding signal components to the audio 
signal stream applied to the coding system and said fuither time scaling has the effect 

25 of adding signal components to the audio signal sti eam when said first-recited time 
scaling deletes signal components and said further time scaling has the effect of 
deleting signal components to the audio signal sti eam when said first-recited time 
scaling adds signal components. 
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26. The method of claim 25 wherein the time duration of the signal 
components added or deleted by said further time scaling is substantially the same as 
the time duiation of signal components deleted or added by said first-recited time 
scaling, respectively, whereby the tune duration of said audio signal stream is 
substantially unchanged. 

27. The method of claun 22 wherein said audio signal stream applied to the 
coding system is a digital signal stieain in which the audio infonnation is represented 
by samples, the order of said samples representing time, and wherein said first- 
recited time scaling has the effect of deleting samples from or adding samples to the 
digital signal stream appUed to the coding system and said further time scaling has 
the effect of adding samples to the digital signal stream when said first-recited time 
sampling deletes samples from tlie digital signal stream and said fuither time scaling 
has the effect of deleting samples from the digital signal stream when said first- 
recited time sampling adds samples to the digital signal stream. 

28. The method of claim 1 wherein said detecting detects multiple transients 
and said shifting shifts the temporal location of the first of said transients to reduce 
distortion artifacts prior to die first of said b ansients. 

29. The method of claim 28 wherein the temporal location of the fust of said 
transients with respect to said coding blocks is shifted by time scaling said audio 
signal stream preceding the first of said signal transients. 

30. The method of claim 29 wherein a further time scaling is applied 
following the first of said transients and before one or more other of said multiple 
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transients, said further time scaling acting in the opposite sense to the said first- 
recited time scaling. 

3 1 . The method of claim 29 wherein a fmther time scaling is applied 

5 following said transients, said further time scaling acting in the opposite sense to the 
said first-recited time scaling, . V i v. t ; ... . } . 

32. In a decoder of a transfoim-based low-bit-rate audio coding system 
employing coding blocks, a method for reducing distortion artifacts preceding a 

10 signal transient in an audio signal stream subsequent to inverse transformation, 
comprising 

detecting a transient in the audio signal stream, and 

time compressing at least a portion of said distortion artifacts such that the 
time duration of said distortion artifacts is reduced. 

15 

33. The method of claim 32 wherein the portion of the distortion artifacts is 
determined at least in part by the location of the detected transient and a default 
parameter. 

20 34. The method of claim 32 the poition of the distortion artifacts is 

detennined at least in part by the location of the detected transient and signal 
characteristics preceding said transient. 

35. The method of claim 34 wherein said signal characteristics include a 
25 measure of high-frequency components of the audio signal stieam. 
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36. The method of claim 33 or 34 further compiising time expanding prior to 
said time compression such that tlie time evolution and length of the audio signal 
stream is substantially unchanged. 

5 , 37. The method of claim 33 or 34 further comprising time expanding 

subsequent to said time compression such that the length of the audio signal stream is 
substantially unchanged. 
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